conda env create -f ./conda_environment.yaml to create this environment and re-open jupyter inside the environment if you haven't already.fasttree, or with iqtree --fastA jupy notebook like this, is your labjournal for doing research on the computer. In here you keep a record of
You do this just by writing the code and keeping the output saved in here. However, one thing is not kept automatically, and that is the choices you make. Hence for transparent and propper science, it is vital that you make this notebook your own, by writing all observations, desicions etcetera in here.
Describe how you got the sequences you're making a tree of, why you got those sequences there, and what question you are trying to answer by making this tree.
Compared to the Basal v2 workflow, this is changing:
We suspect that the variety of M only, and MIKCc and MIKC sequences is messing up our tree inference. Hence this workflow aims to infer a tree of only MIKCc and MIKC sequences. That should also be the case for the root of the tree, algal sequences. In a separate notebook I collected algal sequences which at sight seem to have at least the MIK domains.
I'm choosing here to remove short sequences after aligning for the literature review I would need to do up front, or working with lengh only, is not satisfactory.
Extra outgroup sequences:https://www.uniprot.org/uniprot/C4IGU9 https://www.uniprot.org/uniprot/C4IGU6
And the paper where I got these: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC548990/
Just FYI, a screenshot of the domains from the basalv2 workflow:
Store the sequences you want to make a tree of in the data directory and make the inseq variable the name of your input fasta without the extention:
inseq=MIKC_orthogroup
Let's look at the first ten lines of your fasta to double-check.
head data/$inseq.fasta
Check the nr. of sequences in your fasta file:
grep '>' data/$inseq.fasta -c
I'm taking the sequences of selection basalv2 and the selection-agal
cat data/MIKC_orthogroup_selection-algae.fasta data/MIKC_orthogroup_selection-basal-v2.fasta > data/MIKC_orthogroup_selection-basal-v3.fasta
If you have versions of your sequence of special interest, or functionally verified ones, be sure to add them!
I imagine you have your guide sequences named something like data/guide_sequences_v1.fasta Combine the two files like so and update the $inseq variable with the new name if you are done in this section.
# for the selection workflow
inseq=MIKC_orthogroup
cat data/outgroup_algae.fasta \
data/Zhang-2019-fig4_MIKC_Azfi-gymnosperms_selection-v3.faa \
data/Azfi-v1_MIKC_manual.faa \
data/azolla_filiculoides/filiculoides_fernbase_blast_results.fasta \
data/piceaabies_sequences/picea-abies_MADs_blast_results_uniq.fasta \
data/salvinia_sequences/salivinia_fernbase_blast_results_uniq.fasta \
data/Henriette_guideazfiv2.fasta \
data/"$inseq"_selection-basal-v3.fasta \
> data/"$inseq"_selection-basal-v3_guide-v4.fasta
grep '>' data/"$inseq"_selection-basal-v3_guide-v4.fasta -c
grep '>' data/"$inseq"_selection-basal-v3_guide-v4.fasta
# for the selection workflow
inseq="$inseq"_selection-basal-v3_guide-v4
echo $inseq
head data/$inseq.fasta
grep '>' -c data/$inseq.fasta
Now we have our fasta file with a feasible amount of sequences. Next step is aligning these. While this may seem trivial, the method of aligning can actually influence your end results quite a bit. Roughly speaking there is several alignment algorithms:
Especially for bigger datasets, I prefer mafft for it is simply very fast and gave me good results in the past. But by all means try more ways if you get odd results.
If you have a good idea of what you're doing, and you want to run multiple alignments in a loop and go have lunch, have a look at section 2.2.
If you find mafft options and parameters confusing, and/or you have difficulty making alignments,
then you may tre to use the online service here.
The online MAFFT service does a good job at explaining the parameters and has a nice visualisation as well!
So read the webpage, and choose your options and parameters aided by the explanations in the webpage. When you submit your job, the mafft command issued in the background is actually shown to you! Hence you can copy paste that command here if you'd like. That's especially useful when the server is under high load, in this notebook you may choose to use all threads available on your computer --threads $(nproc).
Using the MSAviewer that you can open after running your alignment on this server, you can even interactivelly trim. From a reproducibility/scaling point of view, this is not ideal, but to get a feeling for what you are doing, it is very usefull. Just make sure you keep a record of what you do, and keep intermediate results with clear names.
I'll start with showing you my go-to approach. First, have a look at the manual.
Next I'll make a directory to store the untrimmed (hence raw) alignments and run the alignment on all available CPU cores.
I like to do this in 'if loops' to prevent re-doing things unnecessarily.
# If jupyter cannot find mafft, but it is installed via conda, try this pragmatic fix:
conda activate phylogenetics
choosing einsi method now, for it is used with multiple conserved domains where linsi focusses on a single conserved domain.
if [ ! -d ./data/alignments_raw/ ]
then mkdir ./data/alignments_raw
fi
if [ ! -f "./data/alignments_raw/$inseq"_aligned-mafft-linsi.fasta ]
then einsi --thread $(nproc) data/$inseq.fasta > ./data/alignments_raw/"$inseq"_aligned-mafft-einsi.fasta
fi
ls ./data/alignments_raw
head ./data/alignments_raw/"$inseq"_aligned-mafft-einsi.fasta
First I remove collumns with hardly any info.
if [ ! -d data/alignments_trimmed ]
then mkdir data/alignments_trimmed
fi
# define appendix only once here:
trimappendix='trim-gt1'
for a in "data/alignments_raw/$inseq"_aligned*.fasta
do appendix=$(echo $a | cut -d '/' -f 3- | sed "s/$inseq\_//" | sed "s/.fasta//")
if [ ! -f data/alignments_trimmed/"$inseq"_"$appendix"_"$trimappendix".fasta ]
then echo "trimming alignment $a"
sed -i 's/ /_/g' $a
trimal -in $a \
-out data/alignments_trimmed/"$inseq"_"$appendix"_"$trimappendix".fasta \
-gt .1 \
-htmlout data/alignments_trimmed/"$inseq"_"$appendix"_"$trimappendix".html
fi
done
That already gives me the M, I and K domains. Now I'll just manually remove everything that doesn't have a K domain or looks suspicious too me.
ls data/alignments_trimmed/$inseq*.fasta
Now This is how it looks trimmed:
Removed 47 sequences by hand (in seqview) for they lacked major parts of the MADS DNA binding domain.
Here we'll make fast trees: not acurate, no bootstraps, but fast. This gives us an idea of the output and how we will process it. Building 'propper' trees can take days sometimes weeks, so it's better to be sure you have all sequences in there you want before you start.
I use two ways to make thise fast trees, first with a program called fasttree and second with the programm iqtree with the -fast parameter. My gut feeling is that the latter is a bit more acurate but takes a couple of minutes. Fasttree takes seconds.
I arbitrarily consider trees to be analyses and not data, hence I store these in the analyses directory.
Since these trees run fast (just take a second to consider how rediculous that sounds) I propose to run these in loops again, taking all the trimmed alignments that were made earlier. The trees run in parallel on one CPU. If you're running many trees (way more than you have computing cores) then don't run these in the background. Practically, that means removing the & character almost at the end of the loop.
for a in data/alignments_trimmed/"$inseq"_aligned*.fasta
do echo "making a fasttree of file $a"
appendix=$(echo $a | cut -d '/' -f 3- | sed "s/$inseq\_//" | sed "s/.fasta//")
echo $appendix
if [ ! -d analyses/"$inseq"_fasttrees/"$appendix" ]
then mkdir -p analyses/"$inseq"_fasttrees/"$appendix"
fi
prefix=analyses/"$inseq"_fasttrees/"$appendix"/"$inseq"_"$appendix"_fasttree
if [ ! -f "$prefix".tree ]
then nice fasttree -log "$prefix".log \
$a \
> "$prefix".tree \
2> "$prefix".stderr &
fi
done
wait
tail analyses/"$inseq"_fasttrees/*/*fasttree.stderr
tree analyses/"$inseq"_fasttrees/
And here is the same but for running iqtree. I picked some random model here, but substitute it by anything you like better or have good experience with it the past.
for a in data/alignments_trimmed/"$inseq"_aligned*.fasta
do echo "making a iqtree fast tree of file $a"
appendix=$(echo $a | cut -d '/' -f 3- | sed "s/$inseq\_//" | sed "s/.fasta//")
echo $appendix
if [ ! -d analyses/"$inseq"_fasttrees/"$appendix" ]
then mkdir -p analyses/"$inseq"_fasttrees/"$appendix"
fi
iqprefix=analyses/"$inseq"_fasttrees/"$appendix"/"$inseq"_"$appendix"_iqtree-fast
if [ ! -f "$iqprefix".iqtree ]
then nice iqtree -s $a -fast \
-m 'LG+R7' \
-pre "$iqprefix" \
> "$iqprefix".stdout \
2> "$iqprefix".stderr &
fi
done
wait
To visualise your trees, you perhaps already have something installed like mega, seaview, etc. Otherwise you can upload the tree file to iToL (my prefered method) or any other website that visualises trees. See section 6 for uploading your trees to iToL.
Alternativelly, we can try to get a quick snapshot here in the notebook:
You may want to run a quick comparison table like so:
ete3 compare --unrooted -t ./analyses/MYBs-nina_fasttrees/*/*fasttree.tree -r ./analyses/MYBs-nina_fasttrees/*/*iqtree-fast.treefile
Collumns abbreviations:
Big numbers or fractions are usually not a good sign, but can happen. Just be extra carefull before drawing any conclusions.
Finally, we're at the stage to build propper maximum likelyhood phylogenetic trees! Based on your previous results, you should have one or two trimmed alignments you want to make a tree of. There is several choices to make still: a model of evolution and a bootstrapping method.
modelfinder
IQtree is a state-of-the art tree buildling program, which has a model finder algorithm included! This can take a couple of hours, so be sure to do this only once. There is two model finder options, a quick one with some often used models: -m TEST or an extended modelfinder, using more models of evolution and substitution: -m MFP. I recommend the latter. Once you have your best-fit model (for example: 'LG+R7') then use this model when you build more trees from the same alignment: -m 'LG+R7'
bootstrapping
Normal or 'non-parametric' bootstrapping can take quite a long time; I have had trees running for weeks. Hence there is alternatives that are a lot faster but might over or underestimate the bootstrap values if your alignment doesn't fit your model well. To use 'normal bootstraps' the minimum is 100. That's why I like to to 200 to be safe, by adding the option -b 200.
Alternativelly, there is the 'ultrafast bootstrap' option in IQtree. The minumum for this is 1000 bootstraps, so I'd like to do double by including the parameter: -bb 2000. Additionally, I highly recommend also running the approximate likelyhood ratio test for 2000 bootstraps at the same time by including parameters -alrt 2000. This adds a minimal amount of run time and makes interpretation of your tree a lot more reliable.
As the IQtree FAQ says: typically you start believing a clade when the ultra fast bootstraps => 95 and alrt => 80. Interpretation of these values is not linear like 'normal' bootstrap, hence if you lower the threshold of ultrafast bootstraps to 90, you will likely enormously overestimate your results.
other command-line options
In the commandline I wrote below, I instruct iqtree to use no more CPU cores than your computer has, but also to find the optimum amount of cores (more is not always better). Second, a prefix is defined to store the different trees that IQtree wil make.
More info
Now these are all trimmed alignments you have available. Choose one to start with (based on your fasttrees or inspections of your alignments).
Make sure that
$a inseq=MIKC_orthogroup_selection-basal-v3_guide-v4
ls data/alignments_trimmed/"$inseq"_aligned*fasta
conda activate phylogenetics
a=data/alignments_trimmed/MIKC_orthogroup_selection-basal-v3_guide-v4_aligned-mafft-einsi_trim-gt1-seqrmmanual.fasta
#iqpendix='iqtree-b200'
iqpendix='iqtree-MFP-bb2000-alrt2000'
echo "making a tree of file $a"
echo "The first lines of alignment $a look like this"
head $a
file_appendix=$(echo $a | cut -d '/' -f 3- | sed "s/$inseq\_//" | sed "s/.fasta//")
echo "Making a directory $file_appendix to store trees (name based on alignment filename)"
if [ ! -d analyses/"$inseq"_trees/"$file_appendix" ]
then mkdir -p analyses/"$inseq"_trees/"$file_appendix"
fi
iqprefix=analyses/"$inseq"_trees/"$file_appendix"/"$inseq"_"$file_appendix"_"$iqpendix"
if [ ! -f analyses/"$inseq"_trees/"$file_appendix"/"$inseq"_"$file_appendix"_iqtree-bb1000.tree ]
then nice iqtree -s $a \
-m 'MFP' \
-bb 2000 \
-alrt 2000 \
-nt AUTO \
-ntmax $(nproc) \
-pre "$iqprefix" \
2> "$iqprefix".stderr \
> "$iqprefix".stdout && cat "$iqprefix".log | mail -s IQtree_run laura &
fi
You can have a look at the last lines of your log file like this:
tail -n 30 $iqprefix.log
grep 'START BOOTSTRAP' $iqprefix.log -c
grep 'START BOOTSTRAP' $iqprefix.log
grep 'Total CPU time used' $iqprefix.log -c
grep 'Total CPU time used' $iqprefix.log
Are you content with your tree? Great news! If you want to do another run, I recommend copying the cell above and editing the copy. That way you keep the code for all trees you made. Don't forget to explain what you observed, why you're making a new tree, and what you're changing (remember this is your labjournal).
For tree storage and sharing, I have yet to encounter a better tool than EMBLs iToL. It's a great interface for exploring and sharing trees with colleagues. You can browse to the treefile IQtree created on your computer and upload it to iToL. Alternativelly, you can copy paste the contents of the file to iToL. Make sure to keep the original filename as well! This file name now contains a brief summary of how this tree was made.
The tree of the alignment I send before (basal-v3) finished running now. It is online here: https://itol.embl.de/tree/13121159166346421593419936
The new algal outgroup nicely clusters together except for a few spurious oddities. Inside the algal outgroup, there is arabidopsis AGL33, which seems like MADSbox only protein in my alignment. I haven't been too strict on the guide sequences, but this one will have to go out if we're only comparing MIKC genes. Working our way through the tree starting at the algae clade, there is a big polyphyletic clade containing few Azolla sequences. This clade contains MpMADS1, AGL 30 65 66 67 94 104, and several Selaginella MADS sequences: 2 4 and 10. Then come the homeotic B genes in a clade containing Pi AP3 SVP. Then finally, there is the clade of flowering genes (now containing FLC) and close to that our sequence of interest in the CRM7 clade. There is a paper copy on your desk if you'd like on which I scrabbled a bit, but other than that it is no different than the online version.
There is now two new trees running. First I started a 100 normal bootstrap tree on an alignment like this, however, I removed all Azfiv2 sequences, and Azfiv1 and Salvinia sequences which had no clear K domain. This should be finished sometime tomorrow evening. Normal bootstraping is feasible for an alignment like this, and it's just a lot safer than the fancy methods which may overestimate support values. Especially when sequences have diverged a lot. Hopefully, the difference is marginal. In any case, the support estimations will be more accurate.
Second, I think it is worth trying a stricter trimming on this. Topology wise, we have reasonable monophyletic clades on the macro level. Yet these clades don't very nicely reflect speciations patterns one would expect. I suspect that gappy regions may mess up the speciation patterns and artificially add to branch lengths. I'll trim with gap threshold 50% and run a tree on UFbootstraps+SHALRT so we should now tomorrow morning.
Some other annoyances: The only gymnosperm, Picea abies, actually behaves oddly at times. Ideally, I would remove it and replace it by a few others, but who has the time and will it add that much... Perhaps it is worth adding just one other gymnosperm, to see if they cluster together. I have two angiosperms, rice and arabidopsis. Even in the odd places, they cluster together which gives me some extra confidence they are placed correctly. The algal root perhaps is a bit overkill, we could do with a couple less.